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Abstract 



in this paper, 1 discuss machine transia- 
tion of English text into a relatively "free" 
word order language, specifically Turkish. I 
present algorithms that use contextual in- 
formation to determine what the topic and 
the focus of each sentence should be, in or- 
der to generate the contextually appropri- 
ate word orders in the target language. 

1 Introduction 

Languages such as Catalan, Czech, Finnish, Ger- 
man, Hindi, Hungarian, Japanese, Polish, Russian, 
Turkish, etc. have much freer word order than En- 
glish. For example, all six permutations of a transi- 
tive sentence are grammatical in Turkish (although 
SOV is the most common). When we translate an 
English text into a "free" word order language, we 
are faced with a choice between many different word 
orders that are all syntactically grammatical but are 
not all felicitous or contextually appropriate. In this 
paper, I discuss machine translation (MT) of En- 
glish text into Turkish and concentrate on how to 
generate the appropriate word order in the target 
language based on contextual information. 

The most comprehensive proj ect of this type is 
presented in (|Stys/Zemke, 1995|) for MT into Pol- 
ish. They use the referential form and repeated 
mention of items in the English text in order to 
predict the salience of discourse entities and order 
the Polish sentence according to this salience rank- 
ing. They also rely on statistical data, choosing the 
most frequently used word orders. I argue for a 
more generative approach: a particular information 
structure (IS) can be determined from the contex- 
tual information and then can be used to generate 
the felicitous word order. This paper concentrates 
on how to determine the IS from contextual informa- 
tion using centering, old vs. new information, and 



contrastiveness. (Hajicova/etal, 1993; Steinberger 



1994) present approaches that determine the IS by 



using cues such as word order, definiteness, and com- 
plement semantic types (e.g. temporal adjuncts vs. 
arguments) in the source language, English. I be- 
lieve that we cannot rely upon cues in the source 
language in order to determine the IS of the trans- 
lated text. Instead, I use contextual information in 
the target language to determine the IS of sentences 
in the target language. 

In section 2, I discuss the Information Structure, 
and specifically the topic and the focus in naturally 
occurring Turkish data. Then, in section 3, I present 
algorithms for determining the topic and the focus, 
and show that we can generate contextually appro- 
priate word orders in Turkish using these algorithms 
in a simple MT implementation. 

2 Information Structure 

In the Information Structure (IS) that I use for Turk- 
ish, a sentence is first divided into a topic and a com- 
ment. The topic is the main element that the sen- 
tence is about, and the comment is the information 
conveyed about this topic. Within the comment, we 
find the focus, the most information-bearing con- 
stituent in the sentence, and the ground, the rest of 
the sentence. The focus is the new or important 
information in the sentence and receives prosodic 
prominence in speech. 

In Turkish, the pragmatic function of topic is as- 
signed to the sentence-initial position and the focus 



to the immediately preverbal position, following (Er 
^'uvanli, 1984 ). The rest of the sentence forms the 
ground. 



In (iHoffman, 1995| ; [Hoffman, 1995b|) , I show that 
the information structure components of topic and 
focus can be successfully used in generating the 
context-appropriate answer to database queries. De- 
termining the topic and focus is fairly easy in the 
context of a simple question, however it is much 



The Cb in SOV sentences. 
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The Cb in OSV sentences. 


Cb = Subject 
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Cb = Object 
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30 



Figure 1: The Cb in SOV and OSV Sentences. 



more complicated in a text. In the following sec- 
tions, I will describe the characteristics of topic, fo- 
cus, and ground components of the IS in natu rally 
occurring texts analyzed in ( Hoffman, 1995b ) and 
allude to possible algorithms for determining them. 
The algorithms will then be spelled out in section 3. 

An example text from the corpus]^ is shown be- 
low. The noncanonical OSV word order in (l)b 
is contextually appropriate because the object pro- 
noun is a discourse-old topic that links the sentence 
to the previous context, and the subject, "your fa- 
ther" , is a discourse-new focus that is being con- 
trasted with other relatives. Discourse-old entities 
are those that were previously mentioned in the dis- 
course while discourse-new entities are those that 
were not ( [Prince, 1992 ). 

(1) a. Bu defteri de gok sevdim ben. 
This notebk-acc too much like-pst-lS I. 
'As for this notebook, I like it very much.' 

b. 

Bunu da baban mi verdi? (OSV) 

This-Acc too father-2S Quest give-Past? 

'Did your FATHER give this to you?' 

(CHILDES Iba.cha) 
Many people have suggested that "free" word or- 
der languages order information from old to new in- 
formation. However, the Old-to-New ordering prin- 
ciple is a generalization to which exceptions can be 
found. I believe that the order in which speakers 
place old vs. new items in a sentence reflects the in- 
formation structures that are available to the speak- 
ers. The ordering is actually the Topic followed by 
the Focus. The Topic tends to be discourse-old in- 
formation and the focus discourse-new. However, 
it is possible to have a discourse-NEW topic and a 
discourse-OLD focus, as we will see in the following 
sections, which explains the exceptions to the Old- 
To-New ordering principle. 



2.1 Topic 

Although humans can intuitively determine what 
the topic of a sentence is, the traditional defini- 
tion (what the sentence is about) is too vague to 
be implemented in a computational system. I pro- 
pose heuristics based on familiarity and salience to 
determine discourse-old sentence topics, and heuris- 
tics based on grammatical relations for discourse- 
new topics. Speakers can shift to a new topic at 
the start of a new discourse segment, as in (2)a. Or 
they can continue talking about the same discourse- 
old topic, as in (2)b. 

(2) a. [Maryjr went to the bookstore. 

b. [Shejr bought a new book on linguistics. 

A discourse-old topic often serves to link the sen- 
tence to the previous context by evoking a famil- 
iar and salient discourse entity. Centering Theory 
( Grosz/etal, 1995| ) provides a measure of saliency 
based on the observations that salient discourse en- 
tities are often mentioned repeatedly within a dis- 
course segment and are often realized as pronouns. 
( [Turan, 1995 ) provides a comprehensive study of 



null and overt subjects in Turkish using Centering 
Theory, and I investigate the interaction between 



word order and Centering in Turkish in (Hoffman 



1996) 



^The data was collected from transcribed conversa- 
tions, contemporary novels, and adult speech from the 
CHILDES corpus. ' 



In the Centering Algorithm, each utterance in a 
discourse is associated with a ranked list of discourse 
entities called the forward-looking centers (Cf list) 
that contains every discourse entity that is realized 
in that utterance. The Cf list is usually ranked ac- 
cording to a hierarchy of grammatical relations, e.g. 
subjects are assumed to be more salient than ob- 
jects. The backward looking center (Cb) is the most 
salient member of the Cf list that links the current 
utterance to the previous utterance. The Cb of an 
utterance is defined as the highest ranked element of 
the previous utterance's Cf list that also occurs in 
the current utterance. If there is a pronoun in the 
sentence, it is likely to be the Cb. As we will see, 
the Cb has much in common with a sentence-topic. 
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Figure 2: Given/New btatus in Different bentence Positions 



The Cb analyses of the canonical bOV and the 
noncanonical OSV word orders in Turkish are sum- 



marized in Figure 1 (forthcoming study in (Hoffman 
1996| )). As expected, the subject is often the Cb in 
the SOV sentences. However, in the OSV sentences, 
the object, not the subject, is most often the Cb of 
the utterance. A comparison of the 20 discourses in 
the first two row^ of the tables in Figure 1 using the 
chi-square test shows that the association between 
sentence-position and Cb is statistically significant 
(X^ = 10.10, p < 0.001).| Thus, the Cb, when it is 
not dropped, is often placed in the sentence initial 
topic position in Turkish regardless of whether it is 
the subject or the object of the sentence. The intu- 
itive reason for this is that speakers want to form a 
coherent discourse by immediately linking each sen- 
tence to the previous ones by placing the Cb and 
discourse-old topic in the sentence-initial position. 

There are also situations where no Cb or 
discourse-old topic can be found. Then, a discourse- 
new topic can be placed in the sentence-initial po- 
sition to start a new discourse segment. Discourse- 
new topics are often subjects or situation-setting ad- 
verbs (e.g. yesterday, in the morning, in the garden) 
in Turkish. 

2.2 Focus 

The term focus has been used with many different 
meanings. Focusing is often associated with new in- 
formation, but it is well-known that old informa- 
tion, for example pronouns, can be focused as well. 
I think part of the confusion lies in the distinction 
between contrastive and presentational focus. Fo- 
cusing discourse-new information is often called pre- 
sentational or informational focus as shown in (3)a. 



^The centering analysis is inconclusive in some cases 
because the subject and the object in the sentence are 
realized with the same referential form (e.g. both as 
overt pronouns or as full NPs). 

^Alternatively, using the canonical SOV sentences as 
the expected frequencies, the observed frequencies for the 
noncanonical OSV sentences significantly diverge from 
the expected frequencies (x^ = 8.8, p < 0.005). 



Broad/wide focus (focus projection) is also possi- 
ble where the rightmost element in the phrase is ac- 
cented, but the whole phrase is in focus. However, 
we can also use focusing in order to contrast one 
item with another, and in this case the focus can be 
discourse-old or discourse-new, e.g. (3)b. 
(3) a. What did Mary do this summer? 

bhe [wandered around TURKEY] j^. 
b. It wasn't [ME]f - It was [HER]f. 
(Vallduvi, 1992) defines focus as the most 



information-bearing constituent, and this definition 
encompasses both contrastive and presentational fo- 
cusing. I use this definition of focus as well. How- 
ever, as will see, we still need two different algo- 
rithms in order to determine which items are in fo- 
cus in the target sentence in MT. We must check to 
see if they are discourse-new information as well as 
checking if they are being contrasted with another 
item in the discourse model. 

In Turkish, items that are presentationally or con- 
trastively focused are placed in the immediately pre- 
verbal (IPV) position and receive the primary ac- 
cent of the phrase.0 As seen in Figure 2, brand- 
new discourse entities are found in the IPV posi- 
tion, but never in other positions in the sentence in 
my Turkish corpus. The distribution of brand-new 
(the starred line of the table) versus discourse-old 
information (the rest of the table|) is statistically 
significant, (x^ = 10.847, p < .001). This supports 
the association of discourse-new focus with the IPV 
position. 

However, as can be seen in Figure 2, most of the 
focused subjects in the ObV sentences in my corpus 



''Some languages such as Greek and Russian treat 
presentational and contrastive focus differently in word 
order. 

^ Inferrables refer to entities that the hearer can eas- 
ily accommodate based on entities already in the dis- 
course model or the situation. Hearer-old entities are 
well-known to the speaker and hear er but not ne cessar- 
ily mentioned in the prior discourse (Prince, 1992). They 
both behave like discourse-old entities. 



were actually discourse-old information. Discourse- 
old entities that occur in the IPV position are con- 
trastively focused. In (Rooth, 1985 )'s alternative-set 
theory, a contrastively focused item is interpreted by 
constructing a set of alternatives from which the fo- 
cused item must be distinguished. Generalizing from 
his work, we can determine whether an entity should 
be contrastively focused by seeing if we can construct 
an alternative set from the discourse model. 
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Those items that do not play a role in IS of the 
sentence as the topic or the focus form the ground of 
the sentence. In Turkish, discourse-old information 
that is not the topic or focus can be 
(4) a. dropped, 

b. postposed to the right of the verb, 

c. or placed unstressed between the topic and 
the focus. 

Postposing plays a backgrounding function in Turk- 
ish, and it is very common. Often, speakers will drop 
only those items that are very salient (e.g. men- 
tioned just in the previous sentence) and postpose 
the rest of the discourse-old items. However, the 
conditi ons for dropp ing arguments can be very com- 
plex. (fTuran, 1995| ) shows that there are semantic 
considerations; for instance, generic objects are of- 
ten dropped, but specific objects are often realized 
as overt pronouns and fronted. Thus, the conditions 
governing dropping and postposing are areas that 
require more research. 

3 The Implementation 

In order to simplify the MT implementation, I con- 
centrate on translating short and simple English 
texts into Turkish, using an intcrlingua representa- 
tion where concepts in the semantic representation 
map onto at most one word in the English or Turkish 
lexicons. The translation proceeds sentence by sen- 
tence (leaving aside questions of aggregation, etc.), 
but contextual information is used during the incre- 
mental generation of the target text. These sim- 
plifications allow me to test out the algorithms for 
determining the topic and the focus presented in this 
section. 

In the implementation, first, an English sentence 
is parse d with a Combin atory Categorial Grammar, 
CCG, ( ^teedman, 1985 ). The semantic representa- 
tion is then sent to the sentence planner for Turk- 
ish. The Sentence Planner uses the algorithms in 
the following subsections to determine the topic, fo- 
cus, and ground from the given semantic represen- 
tation and the discourse model. Then, the sentence 



planner sends the semantic representation and the 
information structure it has determined to the sen- 
tence realization component for Turkish. This com- 
ponent consists of a head-driven bottom up gener- 
ation algorithm that uses the semantic as well as 
the information structure features given by the plan- 
ner to choose an appropriate head in the lexicon. 
The grammar used for the generation of Turkish 



is a lexicalist formalism called Multiset-CCG (Hoff- 



man, 1995; Hoffman, 1995b), an extension of CCGs 



Multiset-CCG was developed in order to capture 
formal and descriptive properties of "free" and re- 
stricted word order in simple and complex sentences 
(with discontinuous constituents and long distance 
dependencies). Multiset-CCG captures the context- 
dependent meaning of word order in Turkish by com- 
positionally deriving the predicate-argument struc- 
ture and the information structure of a sentence in 
parallel. 

The following sections describe the algorithms 
used by the sentence planner to determine the IS 
of the Turkish sentence, given the semantic repre- 
sentation of a parsed English sentence. 

3.1 The Topic Algorithm 

As each sentence is translated, we update the dis- 
course model, and keep track of the forward look- 
ing centers list (Cflist) of the last processed sen- 
tence. This is simply a list of all the discourse 
enities realized in that sentence ranked according 
to the theta-role hierarchy found in the seman- 
tic representation. Thus, the Cf list for the rep- 
resentation give (Pat, Chris, book) is the ranked list 
[Pat, Chris, book], where the subject is assumed to 
be more salient than the objects. 

Given the semantic representation for the sen- 
tence, the discourse model of the text processed 
so far, and the ranked Cf lists of the current and 
previous sentences in the discourse, the following 
algorithm determines the topic of the sentence. 
First, the algorithm tries to choose the most salient 
discourse-old entity as the sentence topic.^ If there is 
no discourse-old entity realized in the sentence, then 
a situation-setting adverb or the subject is chosen as 
the discourse-new topic. 

1. Compare the current Cf list with the previous 
sentence's Cf list and choose the first item that 
is a member of both of the ranked lists (the Cb). 

2. If 1 fails: Choose the first item in the current 
sentence's Cf list that is discourse-old (i.e. is 



^ ( |Stys/Zemke, 1995 ) use the saliency ranking to order 
the whole sentence in Polish. However, I believe that 
there is a distinct notion of topic and focus in Turkish. 



already in the discourse model). 

3. If 2 fails: If there is a situation-setting adverb 
in the semantic representation (i.e. a predicate 
modifying the main event in representation), 
choose it as the discourse-new topic. 

4. If 3 fails: choose the first item in the Cf list (i.e. 
the subject) as the discourse-new topic. 

Note that the determination of the sentence topic 
is distinct from the question of how to realize the 
salient Cb/ topic (e.g. as a dropped or overt pro- 
noun or full NP). In the MT domain, this can be 
determined by the referential form in the source 
text. This trick can also be used for accommodat- 
ing inferrable or hearer-old entities that behave as if 
they are discourse-old even though they are literally 
discourse-new. If an item that is not in the discourse 
model is nonetheless realized as a definite NP in 
the source text, the speaker is treating the entity as 
disco urse-old. This is very similar to ( ^tys/Zemke. 
1995| )'s MT system which uses the referential form in 
the source text to predict the topicality of a phrase 
in the target text. 

3.2 The Focus Algorithm 

Given the rest of the semantic representation for the 
sentence and the discourse model of the text pro- 
cessed so far, the following algorithm determines the 
focus of the sentence. The first step is to deter- 
mine presentational focusing of discourse-new infor- 
mation. Note that the focus, unlike the topic, can 
contain more than one element; this allows broad 
focus as well as narrow focusing. If there is no 
discourse-new information, the second step in the al- 
gorithm allows contrastive focusing of discourse-old 
information. In order to construct the alternative 
sets, a small knowledge base is used to determine 
the semantic type (agent, object, or event) of the 
entities in the discourse model. 

1. If there are any discourse-new entities (i.e. not 
in the discourse model) in the sentence, put 
their semantic representations into focus. 

2. Else for each discourse entity realized in the sen- 
tence, 

(a) Look up its semantic type in the KB and 
construct an alternative set that consists 
of all objects of that type in the discourse 
model, 

(b) If the constructed alternative set is not 
empty, put the discourse entity's semantic 
representation into the focus. 



Once the topic and focus are determined, the re- 
mainder of the semantic representation is assigned 
as the ground. For now, items in the ground are ei- 
ther generated in between the topic and the focus or 
post-posed behind the verb as backgrounded infor- 
mation. Further research is needed to disambiguate 
the use of the two possible word orders. 

Further research is also needed on the exact role 
of verbs in the IS. Verbs can be in the focus or the 
ground in Turkish; this cannot be seen in the word 
order, but it is distinguished by sentential stress 
for narrow focus readings. The algorithm above 
works for verbs since I place events that are realized 
as verbs in translated sentences into the discourse 
model as discourse-old information. However, verbs 
are usually not in focus unless they are surprising or 
contrastive or in a discourse-initial context. Thus, 
the algorithm needs to be extended to accommodate 
discourse-new verbs that are nonetheless expected in 
some way into the ground component. In addition, 
verbs often participate in broad focus readings, and 
further research is needed to account for the obser- 
vation that broad focus readings are only available 
in canonical word orders. 

3.3 Examples 

The English text in (5) is translated using the word 
orders in (6) following the algorithms given above. 
In (6), the numbers following T and F indicate the 
step in the respective algorithm which determined 
the topic or focus for that sentence. Note that the 
inappropriate word orders (indicated by #) cannot 
be generated by the algorithm. 

(5) a. Pat will meet Chris today. 

b. There is a talk at four. 

c. Chris is giving the talk. 

d. Pat cannot come. 

(6) a. 

Bugiin Pat Chris'le bulu§acak. (AdvSOV) 
Today Pat Chris-with meet-fut. (T:3,F:1) 

b. 

Dortde bir konu§ma var. (AdvSV,#SAdvV) 
Four-Loc one talk exist. (T:3,F:1) 

c. Konu§mayi Chris veriyor. (OSV,#SOV) 
Talk-Acc Chris give-Prog. (T:1,F:2) 

d. Pat gelemiyecek. (SV,#VS) 

Pat come-Neg-Fut. (T:2,F:1 for the verb) 

The algorithms can also utilize long distance 
scrambling in Turkish, i.e. constructions where an 



element of an embedded claiise has becin extracted 
and scrambled into the matrix clause in order to play 
a role in the IS of the matrix clause. For example 
the b sentence in the following text is translated us- 
ing long distance scrambling because "the talk" is 
the Cb of the utterance and therefore the best sen- 
tence topic, even though it is the argument of an 
embedded clause. 

(7) a. There is a talk at four. 

b. Pat thinks that Chris will give the talk. 

(8) a. Dortde bir konu§ma var. (AdvSV) 

Four-Loc one talk exist. 

b. 

Konu§mayii Pat [Chris 'in e^ verecegini] 
Talk-AcCj Pat [Chris-gen e^ give-ger-3s-acc] 

saniyor. (O2S1 [SaVaJVi) 
think-Prog. (T:1,F:1) 

4 Conclusions 

In the machine translation task from English into a 

"free" word order language, it is crucial to choose 
the contextually appropriate word order in the tar- 
get language. In this paper, I discussed how to de- 
termine the appropriate word order using contextual 
information in translating into Turkish. I presented 
algorithms for determining the topic and the focus 
of the sentence. These algorithms are sensitive to 
whether the information is old or new in the dis- 
course model (incrementally constructed from the 
translated text); whether they refer to salient en- 
tities (using Centering Theory); and whether they 
can be contrasted with other entities in the discourse 
model. Once the information structure for a seman- 
tic representation is constructed using these algo- 
rithms, the sentence with the contextually appropri- 
ate word order is generated in the target language 
using Multiset CCG, a grammar which integrates 
syntax and information structure. 
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